Maximal Match Chinese Segmentation Augmented by Resources Generated from a Very Large Dictionary for Post-Processing

نویسندگان

  • Ka-Po Chow
  • Andy C. Chin
  • Wing Fu Tsoi
چکیده

We used a production segmentation system, which draws heavily on a large dictionary derived from processing a large amount (over 150 million Chinese characters) of synchronous textual data gathered from various Chinese speech communities, including Beijing, Hong Kong, Taipei, and others. We run this system in two tracks in the Second International Chinese Word Segmentation Bakeoff, with Backward Maximal Matching (right-to-left) as the primary mechanism. We also explored the use of a number of supplementary features offered by the large dictionary in postprocessing, in an attempt to resolve ambiguities and detect unknown words. While the results might not have reached their fullest potential, they nevertheless reinforced the importance and usefulness of a large dictionary as a basis for segmentation, and the implication of following a uniform standard on the segmentation performance on data from various sources.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ambiguity Resolution in Chinese Word Segmentation

A new method for Chinese word segmentation named Conditional F&BMM (Forward and Backward Maximal Matching) which incorporates both bigram statistics (i.e., mutual information and difference of t-test between Chinese characters) and linguistic rules for ambiguity resolution is proposed in this paper. The key characteristics of this model are the use of: (i) statistics which can be automatically ...

متن کامل

Discovering Chinese Words from Unsegmented Text

In English written text, words are separated by spaces, but in written Chinese text, there are no such separators between words. (See Figure 1.) Thus, effective information retrieval of Chinese text first requires good word segmentation. In this paper, we investigate an efficient algorithm to discover the words and their occurrence probabilities from a corpus of unsegmented text without using a...

متن کامل

Chinese Textual Sentiment Analysis: Datasets, Resources and Tools

The rapid accumulation of data in social media (in million and billion scales) has imposed great challenges in information extraction, knowledge discovery, and data mining, and texts bearing sentiment and opinions are one of the major categories of user generated data in social media. Sentiment analysis is the main technology to quickly capture what people think from these text data, and is a r...

متن کامل

Chinese Word Segmentation Using Various Dictionaries

Most of the Chinese word segmentation systems utilizes monolingual dictionary and are used for monolingual processing. For the tasks of machine translation (MT) and cross-language information retrieval (CLIR), another translation dictionary may be used to transfer the words of documents from the source languages to target languages. The inconsistencies resulting from the two types of dictionari...

متن کامل

ISCAS: A Cascaded Approach for CIPS-SIGHAN Micro-Blog Word Segmentation Bakeoff 2012 Track

The state-of-the-art Chinese word segmentation systems have achieved high performance on well-formed long document. However, the segmentation for microblog is difficult due to the noise problem and the OOV problem. In this paper, we present a Chinese Micro-Blog Segmentation system for the CIP-SIGHAN Word Segmentation Bakeoff 2012 track. The proposed system adopts a cascaded approach which conta...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005